7 research outputs found

    Spanish Resource Grammar version 2023

    Full text link
    We present the latest version of the Spanish Resource Grammar (SRG). The new SRG uses the recent version of Freeling morphological analyzer and tagger and is accompanied by a manually verified treebank and a list of documented issues. We also present the grammar's coverage and overgeneration on a small portion of a learner corpus, an entirely new research line with respect to the SRG. The grammar can be used for linguistic research, such as for empirically driven development of syntactic theory, and in natural language processing applications such as computer-assisted language learning. Finally, as the treebanks grow, they can be used for training high-quality semantic parsers and other systems which may benefit from precise and detailed semantics.Comment: 10 pages, 4 figure

    Revisiting Supertagging for HPSG

    Full text link
    We present new supertaggers trained on HPSG-based treebanks. These treebanks feature high-quality annotation based on a well-developed linguistic theory and include diverse and challenging test datasets, beyond the usual WSJ section 23 and Wikipedia data. HPSG supertagging has previously relied on MaxEnt-based models. We use SVM and neural CRF- and BERT-based methods and show that both SVM and neural supertaggers achieve considerably higher accuracy compared to the baseline. Our fine-tuned BERT-based tagger achieves 97.26% accuracy on 1000 sentences from WSJ23 and 93.88% on the completely out-of-domain The Cathedral and the Bazaar (cb)). We conclude that it therefore makes sense to integrate these new supertaggers into modern HPSG parsers, and we also hope that the diverse and difficult datasets we used here will gain more popularity in the field. We contribute the complete dataset reformatted for token classification.Comment: 9 pages, 0 figure

    Assembling Syntax: Modeling Constituent Questions in a Grammar Engineering Framework

    No full text
    Thesis (Ph.D.)--University of Washington, 2021This dissertation is dedicated to a cross-linguistic account of constituent (aka wh-) questions as part of a grammar engineering toolkit, the Grammar Matrix, couched in the Head-driven Phrase Structure Grammar formalism (HPSG). The main \textbf{research question} is: What, in formal grammar terms, constitutes an analysis of the various attested ways to form constituent questions which is demonstrably compatible with analyses of other phenomena that also vary typologically ? I assume here a working definition of \emph{analysis} as a set of HPSG types, including lexical and phrasal, and ways in which these general types vary depending on a given language. By ``varying typologically'' I mean that as the analyses presented here were driven by a review of typological literature on constituent questions, the interacting analyses that are part of the Grammar Matrix were driven by typology of other phenomena. My research question is related to a big question in linguistics: What is the range of possible variation of human languages? Specifically, this work aims to contribute to this big question by providing a set of analyses which are (i) driven by typological surveys; (ii) demonstrably integrated with existing analyses; and (iii) rigorously tested. Thus, while not a claim about possibilities and impossibilities, this work is a step towards establishing a range of specific linguistic analyses which are consistently successful across languages. I test the analyses in terms of the coverage, the overgeneration, and the ambiguity with respect to test suites which include constituent questions along with other syntactic phenomena and come from typologically and genealogically diverse languages. I look in particular detail into Russian for which I compile a test suite of 273 sentences including various types of simple and complex declarative and interrogative clauses. I additionally evaluate the system on five ``held-out'' languages, all from different language families which I did not consider at all during development. On the theoretical level, I conclude that the HPSG filler-gap construction in combination with non-local features such as SLASH and QUE provides a functional basis for cross-linguistic modeling of obligatory question phrase fronting in main clauses but it is not yet fully clear whether they are sufficient to model the contrast between clause-embedding predicates meaning e.g. "think" and "ask", cross-linguistically. I conclude also that question phrase fronting which seems optional on the surface is difficult to formally model as such, which suggests it could be more readily analyzed as a combination of obligatory fronting, with any material appearing in front of the question word licensed by a separate information structure fronting mechanism. I furthermore conclude that "lexical threading'', the HPSG mechanism by which lexical heads project their arguments' nonlocal features, complicates the analysis of fronting and that the entire Grammar Matrix system can be reasoned about more simply without the lexical threading assumption---although interrogative morphology can be modeled more straightforwardly with that assumption. On the grammar engineering level, I conclude that the existing Grammar Matrix system with its lexicon, morphotactics, polar questions, and case libraries can be successfully extended to support an analysis of constituent questions. The Grammar Matrix's information structure library however would require more substantial revisions in order to be integrated with an analysis of constituent questions, especially to support data from languages with flexible word order and data with embedded clauses, from all languages. At the level of the DELPH-IN HPSG \textbf{formalism}, I conclude that the recently suggested append list type can be conveniently used for modeling question phrase fronting instead of the cumbersome difference list append. Finally, on the methodological level, I conclude that using at least one larger test suite with more complex sentences during Grammar Matrix development (along with multiple smaller test suites for typological diversity) involves a cost for typological breadth and a danger of ``overfitting'' the cross-linguistic system to one language but it is still important to uncover issues in the analysis which would otherwise be ignored

    20 years of the Grammar Matrix: cross-linguistic hypothesis testing of increasingly complex interactions

    Get PDF
    The Grammar Matrix project is a meta-grammar engineering framework expressed in Head-driven Phrase Structure Grammar (HPSG) and Minimal Recursion Semantics (MRS). It automates grammar implementation and is thus a tool and a resource for linguistic hypothesis testing at scale. In this paper, we summarize how the Grammar Matrix grew in the last decade and describe how new additions to the system have made it possible to study interactions between analyses, both monolingually and cross-linguistically, at new levels of complexity.</jats:p

    AGGREGATION

    No full text
    This archive is associated with the AGGREGATION project, which seeks to automatically generate HPSG grammars on the basis of Interlinnear Glossed Text data. For a detailed description of this project see Chapter 3 of Inferring Grammars from Interlinear Glossed Text: Extracting Typological and Lexical Properties for the Automatic Generation of HPSG Grammars, PhD thesis by Kristen Howell 2020. This archive includes the following: The AGGREGATION/BASIL syntactic inference repository from https://git.ling.washington.edu/agg/aggregation The MOM morphological inference repository from https://git.ling.washington.edu/agg/mom The Xigt framework for eXtensible Interlinear Glossed Text release 1.1 from https://github.com/xigt/xigt The Grammar Matrix Customization system http://matrix.ling.washington.edu/index.html Code, dependencies and sample data for running the AGGREGATION pipeline end to end.The AGGREGATION Project aims to bring the benefits of grammar engineering to language documentation without requiring field linguists to become grammar engineers. We achieve this by automatically creating precision grammars on the basis of analyses and annotations already produced by field linguists together with a typologically-grounded cross-linguistic grammar resource (the LinGO Grammar Matrix) and natural language processing techniques developed for high-resource languages. Precision grammars are machine-readable encodings of mutually-consistent linguistic hypotheses, in our case, concerning morphotactics, morphosyntax and the syntax-semantics interface. They can be used to automatically process text, assigning structures to input strings and strings to input semantic representations. Text processed in this way can then be searched for sentences or word forms with structures of interest or items that are not covered by the grammar (i.e. fall outside current hypotheses).National Science Foundation under Grant No. BCS-1160274 (PI Bender) National Science Foundation under Grant No. BCS-1561833 (PI Bender
    corecore